6 research outputs found

    INDEPENDENT DE-DUPLICATION IN DATA CLEANING

    Get PDF
    Many organizations collect large amounts of data to support their business and decision-making processes. The data originate from a variety of sources that may have inherent data-quality problems. These problems become more pronounced when heterogeneous data sources are integrated (for example, in data warehouses). A major problem that arises from integrating different databases is the existence of duplicates. The challenge of de-duplication is identifying “equivalent” records within the database. Most published research in de-duplication propose techniques that rely heavily on domain knowledge. A few others propose solutions that are partially domain-independent. This paper identifies two levels of domain-independence in de-duplication namely: domain-independence at the attribute level, and domain-independence at the record level. The paper then proposes a positional algorithm that achieves domain-independent de-duplication at the attribute level, and a technique for field weighting by data profiling, which, when used with the positional algorithm, achieves domain-independence at the record level. Experiments show that the proposed techniques achieve more accurate de-duplication than the existing algorithms

    Domain-independent de-duplication in data warehouse cleaning.

    Get PDF
    Many organizations collect large amounts of data to support their business and decision-making processes. The data collected originate from a variety of sources that may have inherent data quality problems. These problems become more pronounced when heterogeneous data sources are integrated to build data warehouses. Data warehouses integrating huge amounts of data from a number of heterogeneous data sources, are used to support decision-making and on-line analytical processing. The integrated databases inherit the data quality problems that were present in the source databases, and also have data quality problems arising from the integration process. The data in the integrated systems (especially data warehouses) need to be cleaned for reliable decision support querying. A major problem that arises from integrating different databases is the existence of duplicates. The challenge of de-duplication is identifying equivalent records within the database. Most published research in de-duplication propose techniques that rely heavily on domain knowledge. A few others propose solutions that are partially domain-independent. This thesis identifies two levels of domain-independence in de-duplication namely: domain-independence at the attribute level, and domain-independence at the record level. The thesis then proposes a positional algorithm that achieves domain-independent de-duplication at the attribute level. The thesis also proposes a technique for field weighting by data profiling, which, when used with the positional algorithm, achieves domain-independent de-duplication at the record level. Experiments show that the positional algorithm achieves more accurate de-duplication than the existing algorithms. Experiments also show that the data profiling technique for field weighting effectively assigns field weights for de-duplication purposes. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2002 .U34. Source: Masters Abstracts International, Volume: 41-04, page: 1123. Adviser: Christie I. Ezeife. Thesis (M.Sc.)--University of Windsor (Canada), 2002

    Reduced-parameter approaches to time-sensitive discovery and analysis of navigational patterns

    No full text
    Bibliography: p. 130-142Navigational pattern discovery is the application of pattern mmmg techniques to navigational studies. Previous work on navigational pattern discovery strongly relies on user-specified preset parameters (notably, support thresholds). A major drawback of parameter-driven data mining is setting appropriate thresholds. In most practical scenanos, data mining 1s iterative, meaning that the analyst may mme at varymg thresholds before arriving at satisfactory results. This usually proves expensive m parameter-driven techniques. These challenges faced in parameter-driven data mining become very pronounced in environments with continuously streaming data where it is no longer possible to apply simple iterative data mining techniques. This thesis explores a new design goal for the discovery and analysis of navigational patterns, which aims to reduce the use of pre-defined parameters required at the beginning of the mining process (reduced-parameter mining), and completely removes pre-defined parameters wherever possible (parameter-free mining). The new paradigm is developed in the context of time-sensitive navigational pattern discovery. Three broad dimensions are explored where time can significantly affect the ways navigational patterns are to be discovered or analyzed. The first dimension 1s m information systems where the content continuously changes with time (thus affecting navigational behaviour). The second dimension is in high-volume information systems, where the usage-logs continuously change with time. The final dimension of temporal significance in navigational pattern discovery explored in the thesis is the representation and analysis of navigational patterns as full temporal objects ( or time series). This thesis conceptualizes the notions of time significance in navigational pattern discovery with respect to the dimensions identified above, and then proposes novel techniques for discovering navigational patterns in such environments that are based on reduced-parameter or parameter-free principles. Interestingly, the results from this thesis show that reduced-parameter mining and parameter-free mining are practical concepts. The results also show that reduced-parameter/parameter-free mining techniques out­perform parameter-based mining techniques in environments that require continuous updates of patterns. This is contrary to the general belief that early introduction of pre­defined parameters always improves the mining process. This result has strong implications to other data mining tasks

    INDEPENDENT DE-DUPLICATION IN DATA CLEANING

    No full text
    Many organizations collect large amounts of data to support their business anddecision-making processes. The data originate from a variety of sources that may haveinherent data-quality problems. These problems become more pronounced whenheterogeneous data sources are integrated (for example, in data warehouses). A majorproblem that arises from integrating different databases is the existence of duplicates. Thechallenge of de-duplication is identifying “equivalent” records within the database. Mostpublished research in de-duplication propose techniques that rely heavily on domainknowledge. A few others propose solutions that are partially domain-independent. Thispaper identifies two levels of domain-independence in de-duplication namely: domainindependenceat the attribute level, and domain-independence at the record level. Thepaper then proposes a positional algorithm that achieves domain-independent deduplicationat the attribute level, and a technique for field weighting by data profiling,which, when used with the positional algorithm, achieves domain-independence at therecord level. Experiments show that the proposed techniques achieve more accurate deduplicationthan the existing algorithms
    corecore